Aidan Henbest
Dr. Bixler
Data Science
26 January 2023
This data set includes information regarding the class, cap diameter, cap shape, cap surface, cap color, does bruise or bleed, gill attachment, gill spacing, gill color, stem height, stem width, stem root, stem surface, stem color, veil type, veil color, has ring, ring type, spore print color, habitat, and season of 61,069 mushrooms. This data set was retrieved from the Philipps-Universität Marburg website. Although this data is missing many values in some of the columns and is not fully evenly distributed in all categories, there are many values in each of the categories despite that due to the high number of total mushrooms. The data set was chosen because it has an extensive number of lines and a wide variety of columns for machine learning algorithms to learn from. There are no ethical considerations for using this data set as it does not include data on anything that will affect humans or the environment. Included here is a diagram of a basic mushroom to show what many of the terms referenced in this analysis mean:

Link to dataset: https://mushroom.mathematik.uni-marburg.de/
# Import statements
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from matplotlib import cm as cmap
from sklearn.preprocessing import StandardScaler
import warnings
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Filters warnings
warnings.filterwarnings('ignore')
# Allows pandas to print everything without ellipses
pd.set_option('display.max_rows', None, 'display.max_columns', None)
# Set Style
sns.set()
# Creates the data frame
df = pd.read_csv('./data/mushrooms/secondary_data_generated.csv', sep = ';')
# Shows the name of each column, number of entries in each column, and the data type of each column for each data frame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61069 entries, 0 to 61068 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 class 61069 non-null object 1 cap-diameter 61069 non-null float64 2 cap-shape 61069 non-null object 3 cap-surface 46949 non-null object 4 cap-color 61069 non-null object 5 does-bruise-or-bleed 61069 non-null object 6 gill-attachment 51185 non-null object 7 gill-spacing 36006 non-null object 8 gill-color 61069 non-null object 9 stem-height 61069 non-null float64 10 stem-width 61069 non-null float64 11 stem-root 9531 non-null object 12 stem-surface 22945 non-null object 13 stem-color 61069 non-null object 14 veil-type 3177 non-null object 15 veil-color 7413 non-null object 16 has-ring 61069 non-null object 17 ring-type 58598 non-null object 18 spore-print-color 6354 non-null object 19 habitat 61069 non-null object 20 season 61069 non-null object dtypes: float64(3), object(18) memory usage: 9.8+ MB
# Shows the memory used by each column for each data frame
df.memory_usage(deep = True)
Index 128 class 3542002 cap-diameter 488552 cap-shape 3542002 cap-surface 3174882 cap-color 3542002 does-bruise-or-bleed 3542002 gill-attachment 3285018 gill-spacing 2890364 gill-color 3542002 stem-height 488552 stem-width 488552 stem-root 2202014 stem-surface 2550778 stem-color 3542002 veil-type 2036810 veil-color 2146946 has-ring 3542002 ring-type 3477756 spore-print-color 2119412 habitat 3542002 season 3542002 dtype: int64
# Shows the number of values missing in each column for each data frame
df.isna().sum()
class 0 cap-diameter 0 cap-shape 0 cap-surface 14120 cap-color 0 does-bruise-or-bleed 0 gill-attachment 9884 gill-spacing 25063 gill-color 0 stem-height 0 stem-width 0 stem-root 51538 stem-surface 38124 stem-color 0 veil-type 57892 veil-color 53656 has-ring 0 ring-type 2471 spore-print-color 54715 habitat 0 season 0 dtype: int64
# Shows the first 10 rows of the data frame
df.head()
| class | cap-diameter | cap-shape | cap-surface | cap-color | does-bruise-or-bleed | gill-attachment | gill-spacing | gill-color | stem-height | stem-width | stem-root | stem-surface | stem-color | veil-type | veil-color | has-ring | ring-type | spore-print-color | habitat | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | p | 13.83 | x | h | e | f | e | NaN | w | 18.05 | 18.08 | s | y | w | u | w | t | p | NaN | d | w |
| 1 | p | 16.92 | x | g | o | f | e | NaN | w | 18.70 | 18.10 | s | y | w | u | w | t | g | NaN | d | u |
| 2 | p | 15.92 | x | g | o | f | e | NaN | w | 17.86 | 18.65 | s | y | w | u | w | t | g | NaN | d | u |
| 3 | p | 15.73 | f | g | o | f | e | NaN | w | 16.82 | 17.71 | s | y | w | u | w | t | p | NaN | d | u |
| 4 | p | 13.84 | x | h | o | f | e | NaN | w | 18.07 | 18.49 | s | y | w | u | w | t | g | NaN | d | w |
# Performs a basic statistical analysis on the data frame
df.describe()
| cap-diameter | stem-height | stem-width | |
|---|---|---|---|
| count | 61069.000000 | 61069.000000 | 61069.000000 |
| mean | 6.746893 | 6.588775 | 12.155013 |
| std | 5.262972 | 3.362591 | 9.989620 |
| min | 0.410000 | 0.000000 | 0.000000 |
| 25% | 3.490000 | 4.640000 | 5.200000 |
| 50% | 5.890000 | 5.960000 | 10.180000 |
| 75% | 8.540000 | 7.760000 | 16.600000 |
| max | 61.580000 | 35.790000 | 100.830000 |
# Shows the number of values in each category of each column of the data frame
print(df['class'].value_counts())
print()
print(df['cap-shape'].value_counts())
print()
print(df['cap-surface'].value_counts())
print()
print(df['cap-color'].value_counts())
print()
print(df['does-bruise-or-bleed'].value_counts())
print()
print(df['gill-attachment'].value_counts())
print()
print(df['gill-spacing'].value_counts())
print()
print(df['gill-color'].value_counts())
print()
print(df['stem-root'].value_counts())
print()
print(df['stem-surface'].value_counts())
print()
print(df['stem-color'].value_counts())
print()
print(df['veil-type'].value_counts())
print()
print(df['veil-color'].value_counts())
print()
print(df['has-ring'].value_counts())
print()
print(df['ring-type'].value_counts())
print()
print(df['spore-print-color'].value_counts())
print()
print(df['habitat'].value_counts())
print()
print(df['season'].value_counts())
p 33888 e 27181 Name: class, dtype: int64 x 26802 f 13492 s 7099 b 5697 o 3472 p 2685 c 1822 Name: cap-shape, dtype: int64 t 8133 s 7577 y 6396 h 5029 g 4740 d 4440 e 2590 k 2283 i 2198 w 2151 l 1412 Name: cap-surface, dtype: int64 n 24438 y 8487 w 7592 g 4363 e 4027 o 3625 r 1804 u 1783 p 1652 b 1258 k 1223 l 817 Name: cap-color, dtype: int64 f 50479 t 10590 Name: does-bruise-or-bleed, dtype: int64 a 12676 d 10269 x 7413 p 6001 e 5648 s 5648 f 3530 Name: gill-attachment, dtype: int64 c 24710 d 7766 f 3530 Name: gill-spacing, dtype: int64 w 18617 n 9770 y 9419 p 6023 g 4087 f 3530 o 2924 k 2355 r 1385 e 1013 u 1012 b 934 Name: gill-color, dtype: int64 s 3177 b 3177 r 1412 f 1059 c 706 Name: stem-root, dtype: int64 s 5977 y 4938 i 4401 t 2657 g 1765 k 1609 f 1059 h 539 Name: stem-surface, dtype: int64 w 22943 n 18079 y 7788 g 2656 o 2221 e 2039 u 1488 f 1059 p 1034 k 831 r 532 l 239 b 160 Name: stem-color, dtype: int64 u 3177 Name: veil-type, dtype: int64 w 5485 n 525 y 516 u 353 k 353 e 181 Name: veil-color, dtype: int64 f 45890 t 15179 Name: has-ring, dtype: int64 f 48361 e 2468 z 2118 r 1425 l 1404 g 1239 p 1230 m 353 Name: ring-type, dtype: int64 k 2102 w 1237 p 1234 n 1059 g 353 r 186 u 183 Name: spore-print-color, dtype: int64 d 44348 g 7825 l 3117 m 2972 h 1963 w 353 p 350 u 141 Name: habitat, dtype: int64 a 30150 u 22857 w 5333 s 2729 Name: season, dtype: int64
# Creates a heatmap of the correlation in the data frame
plt.figure(figsize = (12, 10))
sns.heatmap(df.corr(), cmap = cmap.PiYG, annot = True, center = 0)
plt.title('Correlation', fontsize = 16)
plt.xlabel('Column', fontsize = 14)
plt.ylabel('Column', fontsize = 14)
Text(117.24999999999999, 0.5, 'Column')
# Creates a copy of the data frame
X = df.copy()
# Makes the categorical data in the data frame copy into numeric data
columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'does-bruise-or-bleed',
'gill-attachment', 'gill-spacing', 'gill-color', 'stem-root',
'stem-surface', 'stem-color', 'veil-type', 'veil-color', 'has-ring',
'ring-type', 'spore-print-color', 'habitat', 'season']
for i in columns:
X[i] = pd.Categorical(X[i])
X[i] = X[i].cat.codes
# Splits the answers off from the rest of the data frame
y = X.pop('does-bruise-or-bleed')
# Creates the training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Scales the training and testing datasets
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Shows the first 10 rows of the data frame
X.head(10)
| class | cap-diameter | cap-shape | cap-surface | cap-color | gill-attachment | gill-spacing | gill-color | stem-height | stem-width | stem-root | stem-surface | stem-color | veil-type | veil-color | has-ring | ring-type | spore-print-color | habitat | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 13.83 | 6 | 3 | 1 | 2 | -1 | 10 | 18.05 | 18.08 | 4 | 7 | 11 | 0 | 4 | 1 | 5 | -1 | 0 | 3 |
| 1 | 1 | 16.92 | 6 | 2 | 6 | 2 | -1 | 10 | 18.70 | 18.10 | 4 | 7 | 11 | 0 | 4 | 1 | 2 | -1 | 0 | 2 |
| 2 | 1 | 15.92 | 6 | 2 | 6 | 2 | -1 | 10 | 17.86 | 18.65 | 4 | 7 | 11 | 0 | 4 | 1 | 2 | -1 | 0 | 2 |
| 3 | 1 | 15.73 | 2 | 2 | 6 | 2 | -1 | 10 | 16.82 | 17.71 | 4 | 7 | 11 | 0 | 4 | 1 | 5 | -1 | 0 | 2 |
| 4 | 1 | 13.84 | 6 | 3 | 6 | 2 | -1 | 10 | 18.07 | 18.49 | 4 | 7 | 11 | 0 | 4 | 1 | 2 | -1 | 0 | 3 |
| 5 | 1 | 12.25 | 2 | 3 | 6 | 2 | -1 | 10 | 17.42 | 17.63 | 4 | 7 | 11 | 0 | 4 | 1 | 2 | -1 | 0 | 3 |
| 6 | 1 | 14.27 | 2 | 3 | 1 | 2 | -1 | 10 | 18.19 | 17.24 | 4 | 7 | 11 | 0 | 4 | 1 | 5 | -1 | 0 | 3 |
| 7 | 1 | 15.44 | 2 | 3 | 1 | 2 | -1 | 10 | 16.80 | 17.47 | 4 | 7 | 11 | 0 | 4 | 1 | 2 | -1 | 0 | 2 |
| 8 | 1 | 13.11 | 6 | 2 | 1 | 2 | -1 | 10 | 16.86 | 17.76 | 4 | 7 | 11 | 0 | 4 | 1 | 5 | -1 | 0 | 3 |
| 9 | 1 | 16.90 | 2 | 3 | 6 | 2 | -1 | 10 | 18.55 | 19.13 | 4 | 7 | 11 | 0 | 4 | 1 | 5 | -1 | 0 | 2 |
Preprocessing of the data consisted of importing packages, changing settings, determining any flaws in the data, looking at some basic data analyses, and formatting the data for machine learning. After all of the packages were imported, some basic settings were set. Warnings were suppressed so future warnings do not aesthetically affect this analysis. Ellipses were removed so the entire dataset is shown when it is printed in this notebook. The style of the future plots was also set.
After this, the info command was used to see the data types of the columns. All of the columns are strings beside three columns that are floats. Then, the memory usage command was used to show how much memory each columns uses. The final step of determining flaws included determining how many items were empty in each column. While there are many missing values, this should not severely impact this analysis as there are so many rows in total.
Next, the head command was shown to view the data frame. It appeared as expected. The describe function was then used, however, only the three float values could be analyzed: cap diameter, stem height, and stem width. The averages for these numeric values were 6.75 cm, 6.59 cm, and 12.16 mm respectively. After this, a values count was performed on all of the non-numeric columns. It appears that many of the columns are unevenly distributed. In particular, the spore print color and habitat variables had very few rows for some of the values. The final piece of analysis was a heat map made between the numeric variables. Cap diameter and stem width are strongly correlated, but stem height is only weakly correlated with stem width and cap diameter.
The last step of the initial data analysis was creating a copy of the data frame in which all of the categorical data was converted to numeric data so it can be used in machine learning algorithms. Then the answer column was split off from the main data frame so they were separated when training the data. Next, the data was split into training and testing data. Eighty percent of the data is being used for training and twenty percent of the data is being used for testing. Lastly, the data was scaled so it is all between zero and one and was subsequently printing out using the head function.
# Creates a scatter plot and a line of best fit from the data frame
plt.figure(figsize = (12, 10))
sns.regplot(x = 'cap-diameter', y = 'stem-height', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'cap-diameter', y = 'stem-height', data = df, s = 1, color = '#4dac26')
plt.title('Cap Diameter versus Stem Height', fontsize = 16)
plt.xlabel('Cap Diameter (cm)', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
Text(0, 0.5, 'Stem Height (cm)')
# Creates a scatter plot and a line of best fit from the data frame
plt.figure(figsize = (12, 10))
sns.regplot(x = 'cap-diameter', y = 'stem-width', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'cap-diameter', y = 'stem-width', data = df, s = 1, color = '#4dac26')
plt.title('Cap Diameter versus Stem Width', fontsize = 16)
plt.xlabel('Cap Diameter (cm)', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
Text(0, 0.5, 'Stem Width (mm)')
As mentioned before, cap diameter has a stronger correlation with stem width than stem height. However, there is a clear positive correlation for both. A scatter plot was created with a line of best fit for both of these variables and it is clear that there is a positive incline. However, there are some outliers in both of these graphs. There are some mushrooms with low cap diameters and oddly high stem widths/heights. Inversely, there are some mushrooms with high cap diameters and oddly low stem widths/heights. Furthermore, there are some obvious groupings in both of these graphs. In particular, there are some mushrooms that do not have stems at all, and therefore they have a stem width and height of zero. These mushrooms may have interfered with the correlation table, but it is unlikely they affected it much due to the low number of them.
# Creates a scatter plot and a line of best fit from the data frame
plt.figure(figsize = (12, 10))
sns.regplot(x = 'stem-height', y = 'stem-width', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'stem-height', y = 'stem-width', data = df, s = 1, color = '#4dac26')
plt.title('Stem Height versus Stem Width', fontsize = 16)
plt.xlabel('Stem Height (cm)', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
Text(0, 0.5, 'Stem Width (mm)')
It is clear that the correlation table created previously was correct, there is little to no correlation between stem width and stem height. A scatter plot was created to display these points with a line of best fit overlaid on top, and it is incredibly obvious that these points are scattered all over with no regularity. Some points appear to make groups, which may indicate they come from the same species, but this cannot be determined definitely. Mushrooms can clearly have thick but short stems or thin but long stems, despite the weak positive correlation.
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG')
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
([<matplotlib.axis.XTick at 0x14f0ad04700>, <matplotlib.axis.XTick at 0x14f0ad046d0>, <matplotlib.axis.XTick at 0x14f0acfa340>, <matplotlib.axis.XTick at 0x14f0ad55460>, <matplotlib.axis.XTick at 0x14f0ad55bb0>, <matplotlib.axis.XTick at 0x14f0ad5a340>, <matplotlib.axis.XTick at 0x14f0ad5aa90>], [Text(0, 0, 'Convex'), Text(1, 0, 'Flat'), Text(2, 0, 'Spherical'), Text(3, 0, 'Bell'), Text(4, 0, 'Conical'), Text(5, 0, 'Sunken'), Text(6, 0, 'Others')])
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'cap-shape', y = 'cap-diameter', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG', showfliers = False)
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
([<matplotlib.axis.XTick at 0x14f0ad881f0>, <matplotlib.axis.XTick at 0x14f0ad881c0>, <matplotlib.axis.XTick at 0x14f0ad7fa30>, <matplotlib.axis.XTick at 0x14f0adc9ca0>, <matplotlib.axis.XTick at 0x14f0add7430>, <matplotlib.axis.XTick at 0x14f0adc9a00>, <matplotlib.axis.XTick at 0x14f0add70a0>], [Text(0, 0, 'Convex'), Text(1, 0, 'Flat'), Text(2, 0, 'Spherical'), Text(3, 0, 'Bell'), Text(4, 0, 'Conical'), Text(5, 0, 'Sunken'), Text(6, 0, 'Others')])
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG', inner = None)
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
([<matplotlib.axis.XTick at 0x14f0ae746a0>, <matplotlib.axis.XTick at 0x14f0ae74670>, <matplotlib.axis.XTick at 0x14f0ae682e0>, <matplotlib.axis.XTick at 0x14f0aeb9a00>, <matplotlib.axis.XTick at 0x14f0aebf190>, <matplotlib.axis.XTick at 0x14f0aebf8e0>, <matplotlib.axis.XTick at 0x14f0aec6070>], [Text(0, 0, 'Convex'), Text(1, 0, 'Flat'), Text(2, 0, 'Spherical'), Text(3, 0, 'Bell'), Text(4, 0, 'Conical'), Text(5, 0, 'Sunken'), Text(6, 0, 'Others')])
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG')
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
([<matplotlib.axis.XTick at 0x14f0aeecc40>, <matplotlib.axis.XTick at 0x14f0aeecc10>, <matplotlib.axis.XTick at 0x14f0aeec340>, <matplotlib.axis.XTick at 0x14f0af45c10>, <matplotlib.axis.XTick at 0x14f0af513a0>, <matplotlib.axis.XTick at 0x14f0af45be0>, <matplotlib.axis.XTick at 0x14f0af51130>, <matplotlib.axis.XTick at 0x14f0af58310>, <matplotlib.axis.XTick at 0x14f0af58a60>, <matplotlib.axis.XTick at 0x14f0af5e1f0>, <matplotlib.axis.XTick at 0x14f0af5e940>], [Text(0, 0, 'Shiny'), Text(1, 0, 'Grooves'), Text(2, 0, 'Sticky'), Text(3, 0, 'Scaly'), Text(4, 0, 'Fleshy'), Text(5, 0, 'Smooth'), Text(6, 0, 'Leathery'), Text(7, 0, 'd'), Text(8, 0, 'Wrinkled'), Text(9, 0, 'Fibrous'), Text(10, 0, 'Silky')])
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'cap-surface', y = 'cap-diameter', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG', showfliers = False)
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
([<matplotlib.axis.XTick at 0x14f0af890a0>, <matplotlib.axis.XTick at 0x14f0af89070>, <matplotlib.axis.XTick at 0x14f0af7edc0>, <matplotlib.axis.XTick at 0x14f0bb10b20>, <matplotlib.axis.XTick at 0x14f0bb1e520>, <matplotlib.axis.XTick at 0x14f0bb1ec70>, <matplotlib.axis.XTick at 0x14f0bb24400>, <matplotlib.axis.XTick at 0x14f0bb24b50>, <matplotlib.axis.XTick at 0x14f0bb246d0>, <matplotlib.axis.XTick at 0x14f0bb1e5e0>, <matplotlib.axis.XTick at 0x14f0bb2a2e0>], [Text(0, 0, 'Shiny'), Text(1, 0, 'Grooves'), Text(2, 0, 'Sticky'), Text(3, 0, 'Scaly'), Text(4, 0, 'Fleshy'), Text(5, 0, 'Smooth'), Text(6, 0, 'Leathery'), Text(7, 0, 'd'), Text(8, 0, 'Wrinkled'), Text(9, 0, 'Fibrous'), Text(10, 0, 'Silky')])
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG', inner = None)
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
([<matplotlib.axis.XTick at 0x14f0bc06580>, <matplotlib.axis.XTick at 0x14f0bc06550>, <matplotlib.axis.XTick at 0x14f0bbf8160>, <matplotlib.axis.XTick at 0x14f0c312790>, <matplotlib.axis.XTick at 0x14f0c318040>, <matplotlib.axis.XTick at 0x14f0c318670>, <matplotlib.axis.XTick at 0x14f0c318dc0>, <matplotlib.axis.XTick at 0x14f0c31d550>, <matplotlib.axis.XTick at 0x14f0c31dca0>, <matplotlib.axis.XTick at 0x14f0c31da60>, <matplotlib.axis.XTick at 0x14f0c318a30>], [Text(0, 0, 'Shiny'), Text(1, 0, 'Grooves'), Text(2, 0, 'Sticky'), Text(3, 0, 'Scaly'), Text(4, 0, 'Fleshy'), Text(5, 0, 'Smooth'), Text(6, 0, 'Leathery'), Text(7, 0, 'd'), Text(8, 0, 'Wrinkled'), Text(9, 0, 'Fibrous'), Text(10, 0, 'Silky')])
Cap shape was compared to cap diameter using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. From these plots it is clear that the others category had the highest average cap diameter and highest maximum cap diameter. The spherical shape had the second highest average cap diameter, but the third highest maximum cap diameter. The flat shape overtook the spherical shape in the maximum comparison, despite the fact that it had only the fourth highest average. The sunken shape had the third highest average and fifth highest maximum, the convex shape had the fifth highest average and the third highest maximum, the conical shape had the sixth highest average and the seventh highest maximum, and the bell shape had the seventh highest average and the sixth highest maximum.
Cap surface was compared to cap diameter using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. From these plots it is clear that the fleshy surface had the highest average cap diameter, but the ninth highest maximum cap diameter. The remaining surface types are ordered like this by average: scaly, silky, smooth, d, sticky, shiny, fibrous, wrinkled, leathery, grooves. The remaining surface types are ordered like this by maximum: scaly, smooth, d, wrinkled, grooves, sticky, silky, shiny, fibrous, leathery.
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG')
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0c342790>, <matplotlib.axis.XTick at 0x14f0c342760>, <matplotlib.axis.XTick at 0x14f0c33c550>, <matplotlib.axis.XTick at 0x14f0c6e4850>, <matplotlib.axis.XTick at 0x14f0c6ed070>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'stem-root', y = 'stem-height', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG', showfliers = False)
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0c7132e0>, <matplotlib.axis.XTick at 0x14f0c7132b0>, <matplotlib.axis.XTick at 0x14f0c70cb50>, <matplotlib.axis.XTick at 0x14f0c754160>, <matplotlib.axis.XTick at 0x14f0c7548b0>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG', inner = None)
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0ce90550>, <matplotlib.axis.XTick at 0x14f0ce90520>, <matplotlib.axis.XTick at 0x14f0ce88100>, <matplotlib.axis.XTick at 0x14f0cec5f40>, <matplotlib.axis.XTick at 0x14f0cecf640>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG')
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0cefb460>, <matplotlib.axis.XTick at 0x14f0cefb430>, <matplotlib.axis.XTick at 0x14f0cef4220>, <matplotlib.axis.XTick at 0x14f0e6004f0>, <matplotlib.axis.XTick at 0x14f0e600c40>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'stem-root', y = 'stem-width', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG', showfliers = False)
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0e620820>, <matplotlib.axis.XTick at 0x14f0e620850>, <matplotlib.axis.XTick at 0x14f0e624d60>, <matplotlib.axis.XTick at 0x14f0e660e20>, <matplotlib.axis.XTick at 0x14f0e66e5b0>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG', inner = None)
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
([<matplotlib.axis.XTick at 0x14f0e6e2100>, <matplotlib.axis.XTick at 0x14f0e6e20d0>, <matplotlib.axis.XTick at 0x14f0e6de880>, <matplotlib.axis.XTick at 0x14f0e719ac0>, <matplotlib.axis.XTick at 0x14f0e7211c0>], [Text(0, 0, 'Swollen'), Text(1, 0, 'Bulbous'), Text(2, 0, 'Rooted'), Text(3, 0, 'Club'), Text(4, 0, 'f')])
Stem root was compared to stem height using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. Oddly, it appears that the f category lacks any stems, and therefore, they do not have any height or width. Besides the f category, the highest average height was the swollen stem roots, followed by rooted, club, and then bulbous stem roots. The only difference when analyzing by maximum values is that the order of club and bulbous are flipped.
Stem root was compared to stem width using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. Besides the f category, the highest average width was the club stem roots, followed by swollen, rooted, and then bulbous stem roots. The only difference when analyzing by maximum values is that the order of rooted and bulbous are flipped. It is interesting that bulbous stem roots have the smallest average width and height.
# Creates a list of all of the supervised learning algorithms that will be tested
models = []
models.append(('LR', LogisticRegression(solver = 'liblinear', multi_class = 'ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma = 'auto')))
models.append(('CART', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
# Creates a list of all the results of each of the supervised learning algorithms that were tested and their names
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)
cv_results = cross_val_score(model, X_train, y_train, cv = kfold, scoring = 'accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
LR: 0.845522 (0.001647) LDA: 0.837908 (0.001529) NB: 0.328278 (0.004411) SVM: 0.997974 (0.000575) CART: 0.998629 (0.000648) KNN: 0.999775 (0.000143)
# Creates a dictionary of all of the names of the supervised learning algorithms that were tested and their mean accuracies
answers = {}
for i in range(len(names)):
answers[names[i]] = np.mean(results[i])
# Creates a bar plot from the dictionary
plt.figure(figsize = (12, 10))
plt.bar(height = list(answers.values()), x = list(answers.keys()), color = sns.color_palette('PiYG'))
plt.title('Comparison of Algorithms', fontsize = 16)
plt.xlabel('Algorithm', fontsize = 14)
plt.ylabel('Average Accuracy (%)', fontsize = 14)
Text(0, 0.5, 'Average Accuracy (%)')
In order to analyze the quality of the supervised learning algorithms, the algorithms first had to be created. The algorithms created are: logistic regression, linear discriminant analysis, naïve Bayes, support vector machine, classification and regression tree, and k-nearest neighbors. Each of these algorithms attempted to predict whether a given mushroom would bruise/bleed or not. K-nearest neighbors was the most accurate with a 99.98% accuracy rate. However, the classification and regression tree as well as the support vector machine algorithms came very close with accuracies of 99.87% and 99.80% accuracy rates, respectfully. The logistic regression and linear discriminant analysis algorithms had a steep drop off in accuracy, with only accuracies of 84.55% and 83.79%, respectfully. There was another steep drop off to the naïve Bayes algorithm, it had an only 32.83% accuracy. All of the algorithms had very low standard deviations and all of the algorithms were visualized in a bar chart.
# Adding all of the layers to the artificial neural network
classifier = Sequential()
classifier.add(Dense(activation = 'relu', input_dim = 20, units = 11, kernel_initializer = 'uniform'))
classifier.add(Dense(activation = 'relu', units = 11, kernel_initializer = 'uniform'))
classifier.add(Dense(activation = 'sigmoid', units = 1, kernel_initializer = 'uniform'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Creates a list of all of the fits and results from the artificial neural network when it is fitted with one to ten epochs
fits = []
accuracies = []
for i in range(1, 11):
print('Number of Epochs:', i, '\n\nFitting:')
fits.append(classifier.fit(X_train, y_train, batch_size = 5, epochs = i, verbose = 2))
print('\nConfusion Matrix:')
y_pred = (classifier.predict(X_test, verbose = 0) > 0.5)
cm = confusion_matrix(y_test, y_pred)
print(cm, '\n\nAccuracy: ')
tn, fp, fn, tp = cm.ravel()
accuracy = (tn + tp) / (tn + tp + fn + fp)
accuracies.append(accuracy)
print(accuracy, '\n\n========================================================================\n')
Number of Epochs: 1 Fitting: 9771/9771 - 26s - loss: 0.1436 - accuracy: 0.9414 - 26s/epoch - 3ms/step Confusion Matrix: [[10056 28] [ 160 1970]] Accuracy: 0.9846078270836744 ======================================================================== Number of Epochs: 2 Fitting: Epoch 1/2 9771/9771 - 22s - loss: 0.0263 - accuracy: 0.9911 - 22s/epoch - 2ms/step Epoch 2/2 9771/9771 - 23s - loss: 0.0153 - accuracy: 0.9950 - 23s/epoch - 2ms/step Confusion Matrix: [[10060 24] [ 13 2117]] Accuracy: 0.9969706893728508 ======================================================================== Number of Epochs: 3 Fitting: Epoch 1/3 9771/9771 - 23s - loss: 0.0117 - accuracy: 0.9959 - 23s/epoch - 2ms/step Epoch 2/3 9771/9771 - 22s - loss: 0.0103 - accuracy: 0.9965 - 22s/epoch - 2ms/step Epoch 3/3 9771/9771 - 22s - loss: 0.0090 - accuracy: 0.9969 - 22s/epoch - 2ms/step Confusion Matrix: [[10058 26] [ 20 2110]] Accuracy: 0.9962338300311119 ======================================================================== Number of Epochs: 4 Fitting: Epoch 1/4 9771/9771 - 23s - loss: 0.0073 - accuracy: 0.9974 - 23s/epoch - 2ms/step Epoch 2/4 9771/9771 - 23s - loss: 0.0069 - accuracy: 0.9977 - 23s/epoch - 2ms/step Epoch 3/4 9771/9771 - 21s - loss: 0.0068 - accuracy: 0.9980 - 21s/epoch - 2ms/step Epoch 4/4 9771/9771 - 21s - loss: 0.0056 - accuracy: 0.9984 - 21s/epoch - 2ms/step Confusion Matrix: [[10056 28] [ 8 2122]] Accuracy: 0.9970525626330441 ======================================================================== Number of Epochs: 5 Fitting: Epoch 1/5 9771/9771 - 22s - loss: 0.0052 - accuracy: 0.9985 - 22s/epoch - 2ms/step Epoch 2/5 9771/9771 - 21s - loss: 0.0049 - accuracy: 0.9985 - 21s/epoch - 2ms/step Epoch 3/5 9771/9771 - 22s - loss: 0.0042 - accuracy: 0.9986 - 22s/epoch - 2ms/step Epoch 4/5 9771/9771 - 21s - loss: 0.0051 - accuracy: 0.9985 - 21s/epoch - 2ms/step Epoch 5/5 9771/9771 - 22s - loss: 0.0045 - accuracy: 0.9987 - 22s/epoch - 2ms/step Confusion Matrix: [[10082 2] [ 19 2111]] Accuracy: 0.9982806615359424 ======================================================================== Number of Epochs: 6 Fitting: Epoch 1/6 9771/9771 - 22s - loss: 0.0040 - accuracy: 0.9989 - 22s/epoch - 2ms/step Epoch 2/6 9771/9771 - 21s - loss: 0.0043 - accuracy: 0.9987 - 21s/epoch - 2ms/step Epoch 3/6 9771/9771 - 22s - loss: 0.0034 - accuracy: 0.9989 - 22s/epoch - 2ms/step Epoch 4/6 9771/9771 - 21s - loss: 0.0040 - accuracy: 0.9989 - 21s/epoch - 2ms/step Epoch 5/6 9771/9771 - 21s - loss: 0.0040 - accuracy: 0.9988 - 21s/epoch - 2ms/step Epoch 6/6 9771/9771 - 21s - loss: 0.0032 - accuracy: 0.9990 - 21s/epoch - 2ms/step Confusion Matrix: [[10083 1] [ 12 2118]] Accuracy: 0.9989356476174881 ======================================================================== Number of Epochs: 7 Fitting: Epoch 1/7 9771/9771 - 21s - loss: 0.0039 - accuracy: 0.9988 - 21s/epoch - 2ms/step Epoch 2/7 9771/9771 - 22s - loss: 0.0035 - accuracy: 0.9991 - 22s/epoch - 2ms/step Epoch 3/7 9771/9771 - 21s - loss: 0.0029 - accuracy: 0.9990 - 21s/epoch - 2ms/step Epoch 4/7 9771/9771 - 22s - loss: 0.0039 - accuracy: 0.9989 - 22s/epoch - 2ms/step Epoch 5/7 9771/9771 - 21s - loss: 0.0030 - accuracy: 0.9990 - 21s/epoch - 2ms/step Epoch 6/7 9771/9771 - 20s - loss: 0.0030 - accuracy: 0.9990 - 20s/epoch - 2ms/step Epoch 7/7 9771/9771 - 22s - loss: 0.0033 - accuracy: 0.9991 - 22s/epoch - 2ms/step Confusion Matrix: [[10075 9] [ 12 2118]] Accuracy: 0.9982806615359424 ======================================================================== Number of Epochs: 8 Fitting: Epoch 1/8 9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 2/8 9771/9771 - 21s - loss: 0.0033 - accuracy: 0.9992 - 21s/epoch - 2ms/step Epoch 3/8 9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9992 - 22s/epoch - 2ms/step Epoch 4/8 9771/9771 - 21s - loss: 0.0025 - accuracy: 0.9993 - 21s/epoch - 2ms/step Epoch 5/8 9771/9771 - 23s - loss: 0.0023 - accuracy: 0.9993 - 23s/epoch - 2ms/step Epoch 6/8 9771/9771 - 23s - loss: 0.0030 - accuracy: 0.9992 - 23s/epoch - 2ms/step Epoch 7/8 9771/9771 - 22s - loss: 0.0031 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 8/8 9771/9771 - 23s - loss: 0.0024 - accuracy: 0.9993 - 23s/epoch - 2ms/step Confusion Matrix: [[10084 0] [ 30 2100]] Accuracy: 0.9975438021942034 ======================================================================== Number of Epochs: 9 Fitting: Epoch 1/9 9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9995 - 22s/epoch - 2ms/step Epoch 2/9 9771/9771 - 22s - loss: 0.0030 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 3/9 9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9992 - 22s/epoch - 2ms/step Epoch 4/9 9771/9771 - 22s - loss: 0.0022 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 5/9 9771/9771 - 22s - loss: 0.0026 - accuracy: 0.9992 - 22s/epoch - 2ms/step Epoch 6/9 9771/9771 - 22s - loss: 0.0023 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 7/9 9771/9771 - 22s - loss: 0.0019 - accuracy: 0.9994 - 22s/epoch - 2ms/step Epoch 8/9 9771/9771 - 22s - loss: 0.0027 - accuracy: 0.9992 - 22s/epoch - 2ms/step Epoch 9/9 9771/9771 - 22s - loss: 0.0018 - accuracy: 0.9993 - 22s/epoch - 2ms/step Confusion Matrix: [[10083 1] [ 2 2128]] Accuracy: 0.9997543802194203 ======================================================================== Number of Epochs: 10 Fitting: Epoch 1/10 9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9995 - 22s/epoch - 2ms/step Epoch 2/10 9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9993 - 22s/epoch - 2ms/step Epoch 3/10 9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9992 - 22s/epoch - 2ms/step Epoch 4/10 9771/9771 - 22s - loss: 0.0019 - accuracy: 0.9995 - 22s/epoch - 2ms/step Epoch 5/10 9771/9771 - 22s - loss: 0.0020 - accuracy: 0.9994 - 22s/epoch - 2ms/step Epoch 6/10 9771/9771 - 21s - loss: 0.0015 - accuracy: 0.9995 - 21s/epoch - 2ms/step Epoch 7/10 9771/9771 - 22s - loss: 0.0017 - accuracy: 0.9996 - 22s/epoch - 2ms/step Epoch 8/10 9771/9771 - 22s - loss: 0.0015 - accuracy: 0.9996 - 22s/epoch - 2ms/step Epoch 9/10 9771/9771 - 23s - loss: 0.0023 - accuracy: 0.9993 - 23s/epoch - 2ms/step Epoch 10/10 9771/9771 - 22s - loss: 0.0015 - accuracy: 0.9996 - 22s/epoch - 2ms/step Confusion Matrix: [[10083 1] [ 10 2120]] Accuracy: 0.9990993941378745 ========================================================================
# Creates a dictionary of all of the epoch counts that were tested and their accuracies
accuracies_dict = {}
for i in range(len(accuracies)):
accuracies_dict[i + 1] = accuracies[i]
# Creates a bar plot from the dictionary
plt.figure(figsize = (12, 10))
sns.barplot(y = list(accuracies_dict.values()), x = list(accuracies_dict.keys()), palette = 'PiYG')
plt.ylim(0.98, 1)
plt.title('Comparison of Artificial Neural Network Epoch Count', fontsize = 16)
plt.xlabel('Epoch Count', fontsize = 14)
plt.ylabel('Accuracy', fontsize = 14)
Text(0, 0.5, 'Accuracy')
# Fits the artificial neural network with nine epochs
print('Number of Epochs: 9\n\nFitting:')
fits.append(classifier.fit(X_train, y_train, batch_size = 5, epochs = 9, verbose = 2))
print('\nConfusion Matrix:')
y_pred = (classifier.predict(X_test, verbose = 0) > 0.5)
cm = confusion_matrix(y_test, y_pred)
print(cm, '\n\nAccuracy: ')
tn, fp, fn, tp = cm.ravel()
accuracy = (tn + tp) / (tn + tp + fn + fp)
# Tests the fit on three example mushrooms
print(accuracy, '\n\nTest Mushroom 1:')
new_customer = [[1, 0.73, 6, 2, 5, 0, -1, 5, 3.72, 0.97, -1, 2, 4, -1, -1, 0, 1, -1, 0, 1]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction, '\n\nTest Mushroom 2:')
new_customer = [[0, 13.43, 6, -1, 5, -1, -1, 10, 12.47, 20.63, 0, -1, 11, 0, 4, 1, 2, -1, 0, 2]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction, '\n\nTest Mushroom 3:')
new_customer = [[0, 9.48, 5, -1, 5, 5, 0, 3, 8.62, 22.77, 0, -1, 10, -1, -1, 0, 1, -1, 1, 3]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction)
Number of Epochs: 9 Fitting: Epoch 1/9 9771/9771 - 12s - loss: 0.0021 - accuracy: 0.9994 - 12s/epoch - 1ms/step Epoch 2/9 9771/9771 - 15s - loss: 0.0020 - accuracy: 0.9994 - 15s/epoch - 2ms/step Epoch 3/9 9771/9771 - 11s - loss: 0.0021 - accuracy: 0.9995 - 11s/epoch - 1ms/step Epoch 4/9 9771/9771 - 11s - loss: 0.0019 - accuracy: 0.9995 - 11s/epoch - 1ms/step Epoch 5/9 9771/9771 - 11s - loss: 0.0021 - accuracy: 0.9995 - 11s/epoch - 1ms/step Epoch 6/9 9771/9771 - 11s - loss: 0.0018 - accuracy: 0.9994 - 11s/epoch - 1ms/step Epoch 7/9 9771/9771 - 11s - loss: 0.0019 - accuracy: 0.9996 - 11s/epoch - 1ms/step Epoch 8/9 9771/9771 - 11s - loss: 0.0012 - accuracy: 0.9994 - 11s/epoch - 1ms/step Epoch 9/9 9771/9771 - 11s - loss: 0.0012 - accuracy: 0.9996 - 11s/epoch - 1ms/step Confusion Matrix: [[10083 1] [ 0 2130]] Accuracy: 0.9999181267398067 Test Mushroom 1: [[7.344817e-09]] Test Mushroom 2: [[0.99999857]] Test Mushroom 3: [[8.545989e-15]]
In order to analyze the quality of the artificial neural networks, the algorithms first had to be created. Ten neural networks were created, each having a given number of epochs (1-10) and a batch size of five. Each of these algorithms attempted to predict whether a given mushroom would bruise/bleed or not. The most accurate algorithm had nine epochs, it had a accuracy of 99.98%. This was nearly the most accurate algorithm created, but the k-nearest neighbors algorithm slightly beats it. All of the algorithms were incredibly accurate, all of them having an over 99% accuracy, besides the algorithm with one epoch which was in the 98% accuracy range. Interestingly, the algorithm with eight epochs has zero false positives. All of the algorithms were visualized in a bar chart and the most accurate algorithm, with nine epochs, was tested on three random mushrooms. It predicted whether each of these mushrooms bruise or bleed correctly.
# Scales the data frame
X_scaled = sc.fit_transform(X)
# Creates a list of all of the WCSS scores on the number of clusters
score_1 = []
for i in range(1, 20):
kmeans = KMeans(n_clusters = i)
kmeans.fit(X_scaled)
score_1.append(kmeans.inertia_)
# Creates a line plot from the list
plt.figure(figsize = (12, 10))
plt.plot(score_1, 'bx-', color = '#4dac26')
plt.title('WCSS vs. Number of Clusters', fontsize = 16)
plt.xlabel('Number of Clusters', fontsize = 14)
plt.ylabel('WCSS Score', fontsize = 14)
Text(0, 0.5, 'WCSS Score')
# Creates bar plots for each of the variables in each of the clusters
kmeans = KMeans(n_clusters = 7, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
labels = kmeans.fit_predict(X_scaled)
X_cluster = pd.concat([X, pd.DataFrame({'cluster': labels})], axis = 1)
for i in X_cluster.columns:
plt.figure(figsize = (35, 5))
for j in range(7):
plt.subplot(1, 8, j + 1)
cluster = X_cluster[X_cluster['cluster'] == j]
cluster[i].hist(bins = 20, color = '#4dac26')
plt.title( '{}\nCluster {}'.format(i, j + 1))
# Creates a principal component analysis of the clusters
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(X_scaled)
pca_X = pd.DataFrame(data = principal_comp, columns = ['pca1', 'pca2'])
pca_X = pd.concat([pca_X, pd.DataFrame({'cluster': labels})], axis = 1)
# Creates a dot plot of the data frame
plt.figure(figsize = (12, 10))
sns.scatterplot(x = 'pca1', y = 'pca2', hue = 'cluster', data = pca_X, palette = 'PiYG', s = 1, edgecolor = 'k', linewidth = 0.05)
plt.title('Clusters', fontsize = 16)
plt.xlabel('PCA1', fontsize = 14)
plt.ylabel('PCA2', fontsize = 14)
Text(0, 0.5, 'PCA2')
In order to analyze the clusters made by the k-means clustering algorithm, the algorithm first had to be created. In order to create the algorithm, an elbow plot was made using a line plot. It was determined the elbow was at seven, and therefore, there should be seven clusters. When analyzing the bar plots created for each of the variables in each of the columns, it is not entirely clear how the algorithm grouped them. However, there are some standouts. In particular, every value in cluster six has a universal veil, while every value in the other columns had no veils at all. Furthermore, cluster two had the widest range of all of the numeric data. The tallest stems, girthiest stems, and widest caps were all in cluster two. Lastly, a scatter plot was created based on a principal component analysis of the clusters to visualize their groupings.
In conclusion, not many correlation were found between the few numeric values in this dataset. The primary correlation found was between cap diameter and stem width. Cap diameter was not very correlated with stem height and stem height was not very correlated with stem width. This makes sense, a wider stem is needed to hold up a wider cap, but a longer stem is not needed to hold up a wider cap and stem width and height do not affect each other at all. It was also determined that the others category of cap shapes have the average widest caps of all cap shapes, while the bell cap shape have the skinniest. Furthermore, the fleshy cap surface was shown to have the highest average cap diameter between the cap surface types, while grooves has the lowest. Swollen roots clearly have the highest stems of all the roots, but club roots have the widest stems of all the roots.
In terms of machine learning, it has become clear that supervised learning algorithms are optimal for this dataset. The unsupervised k-means clustering algorithm clustered the mushrooms pretty nonsensically, but most of the supervised learning algorithms were successful in predicting whether the mushroom would bruise/bleed or not. In particular, the k-nearest neighbors algorithm was the most optimal algorithm, with the artificial neural network with nine epochs not far behind. However, none of the supervised learning algorithms were particularly bad besides the naïve Bayes algorithm with a 32.83% accuracy. The logistic regression and linear discriminant analysis algorithms were also not quite as good as the other algorithms, but they were still both about 85% accurate. The ten artificial neural networks, support vector machine, classification and regression tree, and k-nearest neighbors algorithms were all incredibly successful.